Load GrahpLab Create


In [2]:
import graphlab

Basic settings


In [3]:
graphlab.set_runtime_config('GRAPHLAB_DEFAULT_NUM_PYLAMBDA_WORKERS', 8)


[INFO] graphlab.cython.cy_server: GraphLab Create v2.1 started. Logging: /tmp/graphlab_server_1474509860.log
This non-commercial license of GraphLab Create for academic use is assigned to sudhanshu.shekhar.iitd@gmail.com and will expire on September 18, 2017.

In [4]:
#set canvas to open inline
graphlab.canvas.set_target('ipynb')

Load the people data


In [5]:
people = graphlab.SFrame('people_wiki.gl/')

In [6]:
people.head()


Out[6]:
URI name text
<http://dbpedia.org/resou
rce/Digby_Morrell> ...
Digby Morrell digby morrell born 10
october 1979 is a former ...
<http://dbpedia.org/resou
rce/Alfred_J._Lewy> ...
Alfred J. Lewy alfred j lewy aka sandy
lewy graduated from ...
<http://dbpedia.org/resou
rce/Harpdog_Brown> ...
Harpdog Brown harpdog brown is a singer
and harmonica player who ...
<http://dbpedia.org/resou
rce/Franz_Rottensteiner> ...
Franz Rottensteiner franz rottensteiner born
in waidmannsfeld lower ...
<http://dbpedia.org/resou
rce/G-Enka> ...
G-Enka henry krvits born 30
december 1974 in tallinn ...
<http://dbpedia.org/resou
rce/Sam_Henderson> ...
Sam Henderson sam henderson born
october 18 1969 is an ...
<http://dbpedia.org/resou
rce/Aaron_LaCrate> ...
Aaron LaCrate aaron lacrate is an
american music producer ...
<http://dbpedia.org/resou
rce/Trevor_Ferguson> ...
Trevor Ferguson trevor ferguson aka john
farrow born 11 november ...
<http://dbpedia.org/resou
rce/Grant_Nelson> ...
Grant Nelson grant nelson born 27
april 1971 in london ...
<http://dbpedia.org/resou
rce/Cathy_Caruth> ...
Cathy Caruth cathy caruth born 1955 is
frank h t rhodes ...
[10 rows x 3 columns]


In [7]:
people['word_count'] = graphlab.text_analytics.count_words(people['text'])

In [8]:
tfidf = graphlab.text_analytics.tf_idf(people['word_count'])

In [9]:
people['tfidf'] = tfidf

In [10]:
people.head()


Out[10]:
URI name text word_count
<http://dbpedia.org/resou
rce/Digby_Morrell> ...
Digby Morrell digby morrell born 10
october 1979 is a former ...
{'selection': 1,
'carltons': 1, 'being': ...
<http://dbpedia.org/resou
rce/Alfred_J._Lewy> ...
Alfred J. Lewy alfred j lewy aka sandy
lewy graduated from ...
{'precise': 1, 'thomas':
1, 'closely': 1, ...
<http://dbpedia.org/resou
rce/Harpdog_Brown> ...
Harpdog Brown harpdog brown is a singer
and harmonica player who ...
{'just': 1, 'issued': 1,
'mainly': 1, 'nominat ...
<http://dbpedia.org/resou
rce/Franz_Rottensteiner> ...
Franz Rottensteiner franz rottensteiner born
in waidmannsfeld lower ...
{'all': 1,
'bauforschung': 1, ...
<http://dbpedia.org/resou
rce/G-Enka> ...
G-Enka henry krvits born 30
december 1974 in tallinn ...
{'they': 1,
'gangstergenka': 1, ...
<http://dbpedia.org/resou
rce/Sam_Henderson> ...
Sam Henderson sam henderson born
october 18 1969 is an ...
{'currently': 1, 'less':
1, 'being': 1, ...
<http://dbpedia.org/resou
rce/Aaron_LaCrate> ...
Aaron LaCrate aaron lacrate is an
american music producer ...
{'exclusive': 2,
'producer': 1, 'show' ...
<http://dbpedia.org/resou
rce/Trevor_Ferguson> ...
Trevor Ferguson trevor ferguson aka john
farrow born 11 november ...
{'taxi': 1, 'salon': 1,
'gangs': 1, 'being': 1, ...
<http://dbpedia.org/resou
rce/Grant_Nelson> ...
Grant Nelson grant nelson born 27
april 1971 in london ...
{'houston': 1, 'frankie':
1, 'labels': 1, ...
<http://dbpedia.org/resou
rce/Cathy_Caruth> ...
Cathy Caruth cathy caruth born 1955 is
frank h t rhodes ...
{'phenomenon': 1,
'deborash': 1, 'both' ...
tfidf
{'selection':
3.836578553093086, ...
{'precise':
6.44320060695519, ...
{'just':
2.7007299687108643, ...
{'all':
1.6431112434912472, ...
{'they':
1.8993401178193898, ...
{'currently':
1.637088969126014, ...
{'exclusive':
10.455187230695827, ...
{'taxi':
6.0520214560945025, ...
{'houston':
3.935505942157149, ...
{'phenomenon':
5.750053426395245, ...
[10 rows x 5 columns]

Assignments

1. Compare top words according to word counts to TF-IDF


In [11]:
elton = people[people['name'] == 'Elton John']

In [12]:
elton.head()


Out[12]:
URI name text word_count
<http://dbpedia.org/resou
rce/Elton_John> ...
Elton John sir elton hercules john
cbe born reginald ken ...
{'all': 1, 'least': 1,
'producer': 1, 'heavi ...
tfidf
{'all':
1.6431112434912472, ...
[1 rows x 5 columns]

What are the 3 words in his articles with highest word counts? What are the 3 words in his articles with highest TF-IDF?

These results illustrate why TF-IDF is useful for finding important words. Save these results to answer the quiz at the end.


In [13]:
elton[['tfidf']].stack('tfidf', new_column_name=['word','tfidf']).sort('tfidf', ascending=False)


Out[13]:
word tfidf
furnish 18.38947184
elton 17.48232027
billboard 17.3036809575
john 13.9393127924
songwriters 11.250406447
tonightcandle 10.9864953892
overallelton 10.9864953892
19702000 10.2933482087
fivedecade 10.2933482087
aids 10.262846934
[255 rows x 2 columns]
Note: Only the head of the SFrame is printed.
You can use print_rows(num_rows=m, num_columns=n) to print more rows and columns.


In [14]:
elton[['word_count']].stack('word_count', new_column_name=['word','count']).sort('count', ascending=False)


Out[14]:
word count
the 27
in 18
and 15
of 13
a 10
has 9
john 7
he 7
on 6
award 5
[255 rows x 2 columns]
Note: Only the head of the SFrame is printed.
You can use print_rows(num_rows=m, num_columns=n) to print more rows and columns.

2. Measuring distance

What’s the cosine distance between the articles on ‘Elton John’ and ‘Victoria Beckham’? What’s the cosine distance between the articles on ‘Elton John’ and Paul McCartney’? Which one of the two is closest to Elton John? Does this result make sense to you? Save these results to answer the quiz at the end.

Victoria Beckham


In [15]:
victoria = people[people['name'] == 'Victoria Beckham']

In [16]:
graphlab.distances.cosine(elton['tfidf'][0], victoria['tfidf'][0])


Out[16]:
0.9567006376655429

Paul McCartney


In [17]:
paul = people[people['name'] == 'Paul McCartney']

In [18]:
graphlab.distances.cosine(elton['tfidf'][0], paul['tfidf'][0])


Out[18]:
0.8250310029221779

Building nearest neighbors models with different input features and setting the distance metric


In [19]:
cosine_tfidf_model = graphlab.nearest_neighbors.create(people, features=['tfidf'], label='name', distance='cosine')


Starting brute force nearest neighbors model training.

In [20]:
cosine_word_count_model = graphlab.nearest_neighbors.create(people, features=['word_count'], label='name', distance='cosine')


Starting brute force nearest neighbors model training.

Now we are ready to use our model to retrieve documents. Use these two models to collect the following results:

What’s the most similar article, other than itself, to the one on ‘Elton John’ using word count features?

Save these results to answer the quiz at the end.


In [21]:
cosine_word_count_model.query(elton)


Starting pairwise querying.
+--------------+---------+-------------+--------------+
| Query points | # Pairs | % Complete. | Elapsed Time |
+--------------+---------+-------------+--------------+
| 0            | 1       | 0.00169288  | 31.093ms     |
| Done         |         | 100         | 274.924ms    |
+--------------+---------+-------------+--------------+
Out[21]:
query_label reference_label distance rank
0 Elton John 2.22044604925e-16 1
0 Cliff Richard 0.16142415259 2
0 Sandro Petrone 0.16822542751 3
0 Rod Stewart 0.168327165587 4
0 Malachi O'Doherty 0.177315545979 5
[5 rows x 4 columns]

What’s the most similar article, other than itself, to the one on ‘Elton John’ using TF-IDF features?


In [22]:
cosine_tfidf_model.query(elton)


Starting pairwise querying.
+--------------+---------+-------------+--------------+
| Query points | # Pairs | % Complete. | Elapsed Time |
+--------------+---------+-------------+--------------+
| 0            | 1       | 0.00169288  | 14.546ms     |
| Done         |         | 100         | 269.637ms    |
+--------------+---------+-------------+--------------+
Out[22]:
query_label reference_label distance rank
0 Elton John -2.22044604925e-16 1
0 Rod Stewart 0.717219667893 2
0 George Michael 0.747600998969 3
0 Sting (musician) 0.747671954431 4
0 Phil Collins 0.75119324879 5
[5 rows x 4 columns]

What’s the most similar article, other than itself, to the one on ‘Victoria Beckham’ using word count features?


In [23]:
cosine_word_count_model.query(victoria)


Starting pairwise querying.
+--------------+---------+-------------+--------------+
| Query points | # Pairs | % Complete. | Elapsed Time |
+--------------+---------+-------------+--------------+
| 0            | 1       | 0.00169288  | 77.827ms     |
| Done         |         | 100         | 310.926ms    |
+--------------+---------+-------------+--------------+
Out[23]:
query_label reference_label distance rank
0 Victoria Beckham -2.22044604925e-16 1
0 Mary Fitzgerald (artist) 0.207307036115 2
0 Adrienne Corri 0.214509782788 3
0 Beverly Jane Fry 0.217466468741 4
0 Raman Mundair 0.217695474992 5
[5 rows x 4 columns]

What’s the most similar article, other than itself, to the one on ‘Victoria Beckham’ using TF-IDF features?


In [24]:
cosine_tfidf_model.query(victoria)


Starting pairwise querying.
+--------------+---------+-------------+--------------+
| Query points | # Pairs | % Complete. | Elapsed Time |
+--------------+---------+-------------+--------------+
| 0            | 1       | 0.00169288  | 11.17ms      |
| Done         |         | 100         | 274.287ms    |
+--------------+---------+-------------+--------------+
Out[24]:
query_label reference_label distance rank
0 Victoria Beckham 1.11022302463e-16 1
0 David Beckham 0.548169610263 2
0 Stephen Dow Beckham 0.784986706828 3
0 Mel B 0.809585523409 4
0 Caroline Rush 0.819826422919 5
[5 rows x 4 columns]

That's all folks!


In [ ]: